similarity value
- North America > United States (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Banking & Finance (0.93)
- Education (0.93)
FL-Defender: Combating Targeted Attacks in Federated Learning
Jebreel, Najeeb, Domingo-Ferrer, Josep
Federated learning (FL) enables learning a global machine learning model from local data distributed among a set of participating workers. This makes it possible i) to train more accurate models due to learning from rich joint training data, and ii) to improve privacy by not sharing the workers' local private data with others. However, the distributed nature of FL makes it vulnerable to targeted poisoning attacks that negatively impact the integrity of the learned model while, unfortunately, being difficult to detect. Existing defenses against those attacks are limited by assumptions on the workers' data distribution, may degrade the global model performance on the main task and/or are ill-suited to high-dimensional models. In this paper, we analyze targeted attacks against FL and find that the neurons in the last layer of a deep learning (DL) model that are related to the attacks exhibit a different behavior from the unrelated neurons, making the last-layer gradients valuable features for attack detection. Accordingly, we propose \textit{FL-Defender} as a method to combat FL targeted attacks. It consists of i) engineering more robust discriminative features by calculating the worker-wise angle similarity for the workers' last-layer gradients, ii) compressing the resulting similarity vectors using PCA to reduce redundant information, and iii) re-weighting the workers' updates based on their deviation from the centroid of the compressed similarity vectors. Experiments on three data sets with different DL model sizes and data distributions show the effectiveness of our method at defending against label-flipping and backdoor attacks. Compared to several state-of-the-art defenses, FL-Defender achieves the lowest attack success rates, maintains the performance of the global model on the main task and causes minimal computational overhead on the server.
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
A method for outlier detection based on cluster analysis and visual expert criteria
Lara, Juan A., Lizcano, David, Rampérez, Víctor, Soriano, Javier
Outlier detection is an important problem occurring in a wide range of areas. Outliers are the outcome of fraudulent behaviour, mechanical faults, human error, or simply natural deviations. Many data mining applications perform outlier detection, often as a preliminary step in order to filter out outliers and build more representative models. In this paper, we propose an outlier detection method based on a clustering process. The aim behind the proposal outlined in this paper is to overcome the specificity of many existing outlier detection techniques that fail to take into account the inherent dispersion of domain objects. The outlier detection method is based on four criteria designed to represent how human beings (experts in each domain) visually identify outliers within a set of objects after analysing the clusters. This has an advantage over other clustering-based outlier detection techniques that are founded on a purely numerical analysis of clusters. Our proposal has been evaluated, with satisfactory results, on data (particularly time series) from two different domains: stabilometry, a branch of medicine studying balance-related functions in human beings and electroencephalography (EEG), a neurological exploration used to diagnose nervous system disorders. To validate the proposed method, we studied method outlier detection and efficiency in terms of runtime. The results of regression analyses confirm that our proposal is useful for detecting outlier data in different domains, with a false positive rate of less than 2% and a reliability greater than 99%.
- Europe > Spain > Galicia > Madrid (0.04)
- North America > United States > New York (0.04)
- North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.04)
- (4 more...)
- Research Report > Experimental Study (0.87)
- Research Report > New Finding (0.66)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Diagnostic Medicine (0.88)
Evaluating NLP Embedding Models for Handling Science-Specific Symbolic Expressions in Student Texts
Bleckmann, Tom, Tschisgale, Paul
In recent years, natural language processing (NLP) has become integral to educational data mining, particularly in the analysis of student-generated language products. For research and assessment purposes, so-called embedding models are typically employed to generate numeric representations of text that capture its semantic content for use in subsequent quantitative analyses. Y et when it comes to science-related language, symbolic expressions such as equations and formulas introduce challenges that current embedding models struggle to address. Existing research studies and practical applications often either overlook these challenges or remove symbolic expressions altogether, potentially leading to biased research findings and diminished performance of practical applications. This study therefore explores how contemporary embedding models differ in their capability to process and interpret science-related symbolic expressions. To this end, various embedding models are evaluated using physics-specific symbolic expressions drawn from authentic student responses, with performance assessed via two approaches: 1) similarity-based analyses and 2) integration into a machine learning pipeline. Our findings reveal significant differences in model performance, with OpenAI's GPT-text-embedding-3-large outperforming all other examined models, though its advantage over other models was moderate rather than decisive. Overall, this study underscores the importance for educational data mining researchers and practitioners of carefully selecting NLP embedding models when working with science-related language products that include symbolic expressions. The code and (partial) data are available at https: //doi.org/10.17605/OSF.IO/6XQVG.
- North America > United States > New Jersey (0.04)
- North America > United States > California > Sacramento County > Sacramento (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Education > Curriculum > Subject-Specific Education (0.47)
- Information Technology > Security & Privacy (0.46)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
- North America > United States (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Banking & Finance (0.93)
- Education (0.93)
SINAI at eRisk@CLEF 2022: Approaching Early Detection of Gambling and Eating Disorders with Natural Language Processing
Marmol-Romero, Alba Maria, Jimenez-Zafra, Salud Maria, Plaza-del-Arco, Flor Miriam, Molina-Gonzalez, M. Dolores, Martin-Valdivia, Maria-Teresa, Montejo-Raez, Arturo
This paper describes the participation of the SINAI team in the eRisk@CLEF lab. Specifically, two of the proposed tasks have been addressed: i) Task 1 on the early detection of signs of pathological gambling, and ii) Task 3 on measuring the severity of the signs of eating disorders. The approach presented in Task 1 is based on the use of sentence embeddings from Transformers with features related to volumetry, lexical diversity, complexity metrics, and emotion-related scores, while the approach for Task 3 is based on text similarity estimation using contextualized word embeddings from Transformers. In Task 1, our team has been ranked in second position, with an F1 score of 0.808, out of 41 participant submissions. In Task 3, our team also placed second out of a total of 3 participating teams.
- Europe > Spain > Andalusia (0.04)
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Knowledge Editing for Multi-Hop Question Answering Using Semantic Analysis
Simon, Dominic, Ewetz, Rickard
Large Language Models (LLMs) require lightweight avenues of updating stored information that has fallen out of date. Knowledge Editing (KE) approaches have been successful in updating model knowledge for simple factual queries but struggle with handling tasks that require compositional reasoning such as multi-hop question answering (MQA). We observe that existing knowledge editors leverage decompositional techniques that result in illogical reasoning processes. In this paper, we propose a knowledge editor for MQA based on semantic analysis called CHECK. Our framework is based on insights from an analogy between compilers and reasoning using LLMs. Similar to how source code is first compiled before being executed, we propose to semantically analyze reasoning chains before executing the chains to answer questions. Reasoning chains with semantic errors are revised to ensure consistency through logic optimization and re-prompting the LLM model at a higher temperature. We evaluate the effectiveness of CHECK against five state-of-the-art frameworks on four datasets and achieve an average 22.8% improved MQA accuracy.
- North America > United States (0.46)
- Asia > Japan (0.05)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (7 more...)
On Self-improving Token Embeddings
Kubek, Mario M., Pokharel, Shiraj, Böhme, Thomas, McDaniel, Emma L., Unger, Herwig, Mikler, Armin R.
This article introduces a novel and fast method for refining pre-trained static word or, more generally, token embeddings. By incorporating the embeddings of neighboring tokens in text corpora, it continuously updates the representation of each token, including those without pre-assigned embeddings. This approach effectively addresses the out-of-vocabulary problem, too. Operating independently of large language models and shallow neural networks, it enables versatile applications such as corpus exploration, conceptual search, and word sense disambiguation. The method is designed to enhance token representations within topically homogeneous corpora, where the vocabulary is restricted to a specific domain, resulting in more meaningful embeddings compared to general-purpose pre-trained vectors. As an example, the methodology is applied to explore storm events and their impacts on infrastructure and communities using narratives from a subset of the NOAA Storm Events database. The article also demonstrates how the approach improves the representation of storm-related terms over time, providing valuable insights into the evolving nature of disaster narratives.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (5 more...)
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > Germany > Hesse > Darmstadt Region > Wiesbaden (0.04)
- Information Technology > Information Management > Search (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Communications (1.00)
- (3 more...)
Adaptive Semantic Prompt Caching with VectorQ
Schroeder, Luis Gaspar, Liu, Shu, Cuadron, Alejandro, Zhao, Mark, Krusche, Stephan, Kemper, Alfons, Zaharia, Matei, Gonzalez, Joseph E.
Semantic prompt caches reduce the latency and cost of large language model (LLM) inference by reusing cached LLM-generated responses for semantically similar prompts. Vector similarity metrics assign a numerical score to quantify the similarity between an embedded prompt and its nearest neighbor in the cache. Existing systems rely on a static threshold to classify whether the similarity score is sufficiently high to result in a cache hit. We show that this one-size-fits-all threshold is insufficient across different prompts. We propose VectorQ, a framework to learn embedding-specific threshold regions that adapt to the complexity and uncertainty of an embedding. Through evaluations on a combination of four diverse datasets, we show that VectorQ consistently outperforms state-of-the-art systems across all static thresholds, achieving up to 12x increases in cache hit rate and error rate reductions up to 92%.
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)